{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "import uuid\n",
    "import datetime\n",
    "import pandas as pd\n",
    "from lxml import etree\n",
    "from collections import defaultdict, Counter\n",
    "from tqdm import tqdm_notebook as tqdm\n",
    "import os\n",
    "import matplotlib.pyplot as plt\n",
    "import statistics\n",
    "import numpy as np\n",
    "from eMammal_helpers import clean_species_name, get_total_from_distribution, sort_dict_val_desc, plot_distribution, plot_histogram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print all outputs in a cell\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\"\n",
    "\n",
    "# auto reload external Python modules\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "# display Matplotlib figures inline and set default size\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.dpi'] = 120\n",
    "plt.rcParams['figure.figsize'] = (8.0, 3.0)\n",
    "plt.rcParams['image.interpolation'] = 'nearest'\n",
    "plt.rcParams['image.cmap'] = 'gray'\n",
    "plt.rcParams['font.size'] = 9"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# eMammal dataset stats\n",
    "\n",
    "Contact: Siyu Yang <cameratraps@lila.science>\n",
    "\n",
    "Run this notebook with Python 3.\n",
    "\n",
    "This notebook inspects the eMammal dataset we have in August 2018. A subset of this is sent to iMerit for bounding box annotations. The CSV at `csv_path` contains all the images that have been sent to iMerit for annotations in the first batch.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configurations and paths\n",
    "output_dir_path = '/home/yasiyu/yasiyu_temp'\n",
    "\n",
    "csv_path = '/home/yasiyu/scripts/input/emammal_2018.08.20.csv'  # csv specifying the images sent for annotation\n",
    "\n",
    "deployments_path = '/datadrive/emammal'\n",
    "\n",
    "\n",
    "# constants\n",
    "_people_tags = {\n",
    "    'Bicycle',\n",
    "    'Calibration Photos',\n",
    "    'Camera Trapper',\n",
    "    'camera trappper',\n",
    "    'camera  trapper',\n",
    "    'Homo sapien',\n",
    "    'Homo sapiens',\n",
    "    'Human, non staff',\n",
    "    'Human, non-staff',\n",
    "    'camera trappe',\n",
    "    'Human non-staff',\n",
    "    'Setup Pickup',\n",
    "    'Vehicle'\n",
    "}\n",
    "PEOPLE_TAGS = {x.lower() for x in _people_tags}\n",
    "\n",
    "_no_animal_tags = {'No Animal', 'no  animal', 'Time Lapse', 'Camera Misfire', 'False trigger', 'Blank'}\n",
    "NO_ANIMAL_TAGS = {x.lower() for x in _no_animal_tags}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "The original images and classification annotations are in the `emammal` container in the `wildlifeblobssc` storage account in the AI for Earth Development subscription. The `emammal` container holds collections named after the researcher responsible for them and a number indicating the batch. These were downloaded to the `bobcat` VM's 2TB data disk at `/datadrive/emammal`, forgoing the collection folder level. Scripts for downloading them to the data disk is at `database_tools/copy_and_unzip_emammal.py`. \n",
    "\n",
    "At `/datadrive/emammal`, each folder is one deployment. Each deployment folder contains the sequences of images and a .xml file with information such as timestamp and animal species labels.\n",
    "\n",
    "There are a total of 3140 deployments in the 0McShea and 0Kays collections, and 126 in the 0Long collection.\n",
    "\n",
    "A deployment is a set of image sequences specific in space and time. So a site name could be shared by multiple deployments. I suspect there is an issue with the two projects from China (p193 and p195) with how they name their deployments - they seem to have named some of them with the site name."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First batch of annotations\n",
    "\n",
    "Here we load the csv and an example xml file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# did not work with 'utf-8'\n",
    "data = pd.read_csv(csv_path, encoding='ISO-8859-1', header=0,\n",
    "                   names=['projectID', 'projectName', 'deploymentID', 'siteName', 'speciesPresentCommon', 'imageID', 'annotationSetFileName']) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.sample(n=10)\n",
    "# note that speciesPresentCommon is ; separated when there are more than one species present"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(data)  # 18418 images are getting bbox annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.loc[3, 'annotationSetFileName']  # annotationSetFileName example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(data.projectID.unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Project p126 is not found in blob storage..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[data.projectID == 'p126']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_annotated_img = data['annotationSetFileName']\n",
    "frames = [x.split('.frame')[1].split('.')[0] for x in all_annotated_img]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "possible_frames = set(frames)\n",
    "possible_frames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the seq ID can just be `s11` without the deployment ID `d17432` prefix. But that's how it is in the `<ImageSequenceId>` item.\n",
    "\n",
    "\n",
    "No \"human\" images were included in the annotation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_xml_path = '/datadrive/emammal/4180d18095/deployment_manifest.xml'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(sample_xml_path, 'r') as f:\n",
    "    tree = etree.parse(sample_xml_path)\n",
    "root = tree.getroot()\n",
    "for child in root:\n",
    "    print(child)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree.findtext('ProjectId')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree.find('CameraDeploymentID').text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sequence = tree.find('ImageSequence')\n",
    "for child in sequence:\n",
    "    print(child)\n",
    "\n",
    "print('')\n",
    "\n",
    "# species is only identified at the sequence level, not image level\n",
    "research_ids = sequence.find('ResearcherIdentifications')\n",
    "example_id = research_ids[0]\n",
    "for c in example_id:\n",
    "    print('{} - {}'.format(c, c.text))\n",
    "\n",
    "\n",
    "print('\\nImage has the following fields. The ImageIdentifications is empty.')    \n",
    "for child2 in child:\n",
    "    print(child2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset stats\n",
    "\n",
    "All of the result below (apart from where noted) are based on the entire eMammal dataset, not just the images that were sent for annotation.\n",
    "\n",
    "It is quite fast (~30 seconds) to run the cell below, reading the 3140 xml files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What species are there\n",
    "species_tally = defaultdict(int)\n",
    "\n",
    "# How many animals of the same species are there in each sequence - the content of <Count>\n",
    "# Note that even if the identification is \"empty\"/\"no animal\", the <Count> tag will still be 1\n",
    "animal_counts = []\n",
    "\n",
    "# How many <Identification> items are there in each sequence\n",
    "num_identifications = []\n",
    "\n",
    "# Sequence ID for all sequences that have more than one <Identification> : What species are present\n",
    "seq_with_multi_ids = {}\n",
    "\n",
    "# How many sequences are there in each deployment\n",
    "num_sequences_in_d = []\n",
    "\n",
    "# How many images are in each sequence\n",
    "num_images_in_seq = []\n",
    "\n",
    "\n",
    "# TODO Could get these info for each project separately.\n",
    "\n",
    "\n",
    "total_num_deployments = len(os.listdir(deployments_path))\n",
    "\n",
    "for deployment in tqdm(os.listdir(deployments_path)):\n",
    "    deployment_path = os.path.join(deployments_path, deployment)\n",
    "    manifest_path = os.path.join(deployment_path, 'deployment_manifest.xml')\n",
    "    \n",
    "    with open(manifest_path, 'r') as f:\n",
    "        tree = etree.parse(f)\n",
    "    \n",
    "    root = tree.getroot()\n",
    "    project_id = root.findtext('ProjectId')\n",
    "    deployment_id = root.findtext('CameraDeploymentID')\n",
    "    \n",
    "    image_sequences = root.findall('ImageSequence')\n",
    "    num_sequences_in_d.append(len(image_sequences))\n",
    "    \n",
    "    for sequence in image_sequences:\n",
    "        images = sequence.findall('Image')\n",
    "        num_images_in_seq.append(len(images))\n",
    "        \n",
    "        # get species info\n",
    "        seq_id = sequence.findtext('ImageSequenceId')\n",
    "        full_seq_id = 'datasetemammal.project{}.deployment{}.seq{}'.format(project_id, deployment_id, seq_id)\n",
    "        \n",
    "        researcher_identifications = sequence.findall('ResearcherIdentifications')\n",
    "        \n",
    "        for researcher_id in researcher_identifications:\n",
    "            identifications = researcher_id.findall('Identification')\n",
    "            num_identifications.append(len(identifications))\n",
    "            multi_id_flag = True if len(identifications) > 1 else False\n",
    "                \n",
    "            species = []\n",
    "\n",
    "            for id in identifications:\n",
    "                species_common_name = clean_species_name(id.findtext('SpeciesCommonName'))\n",
    "                species_tally[species_common_name] += 1\n",
    "                \n",
    "                species.append(species_common_name)\n",
    "\n",
    "                count = id.findtext('Count')\n",
    "                animal_counts.append(int(count))\n",
    "                \n",
    "            if multi_id_flag:\n",
    "                seq_with_multi_ids[full_seq_id] = species"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Number of images, sequences and deployments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of sequences in a deployment\n",
    "print('Total number of sequences in the dataset: {}'.format(sum(num_sequences_in_d)))\n",
    "print('Median of {:.0f} sequences in a deployment, average of {:.2f}, min {:.0f}, max {:.0f}'.format(\n",
    "    np.median(num_sequences_in_d), \n",
    "    np.mean(num_sequences_in_d),\n",
    "    min(num_sequences_in_d),\n",
    "    max(num_sequences_in_d)))\n",
    "plot_histogram(num_sequences_in_d, 'Histogram of the number of sequences in deployments')\n",
    "plot_histogram(num_sequences_in_d, 'Histogram of the number of sequences in deployments, max=200', max_val=200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of images/frames in a sequence\n",
    "print('Total number of images in the dataset: {}'.format(sum(num_images_in_seq)))\n",
    "print('Median of {:.0f} images in a sequence, average of {:.2f}, min {:.0f}, max {:.0f}'.format(\n",
    "    np.median(num_images_in_seq), \n",
    "    np.mean(num_images_in_seq),\n",
    "    min(num_images_in_seq),\n",
    "    max(num_images_in_seq)))\n",
    "\n",
    "plot_histogram(num_images_in_seq, 'Histogram of the number of images in sequences, max=40', max_val=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verified that this number of images are present as .jpg:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_num_images = 0\n",
    "\n",
    "for deployment in tqdm(os.listdir(deployments_path)):\n",
    "    deployment_path = os.path.join(deployments_path, deployment)\n",
    "    content = os.listdir(deployment_path)\n",
    "    num_images = sum(1 for i in content if i.lower().endswith('.jpg'))\n",
    "    total_num_images += num_images\n",
    "print('Total of {} images found.'.format(total_num_images))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Species present"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_distribution(species_tally, title='Number of sequences with the species', top=30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sort_dict_val_desc(species_tally)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count tag `<Count>` \n",
    "\n",
    "An example where the count is 3 is `project3062d20814.deploymentd20814.sequenced20814s25`. There are 3 Northern Raccoons but no more than 2 appear in one image.\n",
    "\n",
    "Inspecting a few others of such, count doesn't seem particularly useful. One labeled with count 3 could be 2 animals that becomes invisible in a few frames. \n",
    "\n",
    "As noted before, even if the identification is \"empty\"/\"no animal\", the <Count> tag will still be 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Total number of Count item values in the dataset: {}'.format(sum(animal_counts)))\n",
    "print('Median of {:.0f} for the Count item, average of {:.2f}, min {:.0f}, max {:.0f}'.format(\n",
    "    np.median(animal_counts), \n",
    "    np.mean(animal_counts),\n",
    "    min(animal_counts),\n",
    "    max(animal_counts)))\n",
    "counter = Counter(animal_counts)\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple animals or species in the same image sequence\n",
    "\n",
    "Some image sequences contain more than one `<Identification>` item in `<ResearcherIdentifications>`. There is also a `<Count>` for each `<Identification>`, which seems to mean how many animals of that species are there. This is a little inconsistent: sometimes both `<Identification>` are of the same species as you can see below.\n",
    "\n",
    "This is problematic because iMerit only labels bbox for one class 'animal', so when there are multiple bboxs in an image of different species, there's no way to label the bbox's species correctly (other than manually)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Total number of identification items in the dataset: {}'.format(sum(num_identifications)))\n",
    "print('Median of {:.0f} identification items per sequence, average of {:.2f}, min {:.0f}, max {:.0f}'.format(\n",
    "    np.median(num_identifications), \n",
    "    np.mean(num_identifications),\n",
    "    min(num_identifications),\n",
    "    max(num_identifications)))\n",
    "counter = Counter(num_identifications)\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "152483 / 174277 = 87.5% images have just one identification item in the xml."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(seq_with_multi_ids.values())[:20]  # what species are there in sequences with multiple identifications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_annotated_img = data['annotationSetFileName']\n",
    "all_annotated_seq = set([x.split('.frame')[0] for x in all_annotated_img])\n",
    "len(all_annotated_seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are all the species __in bbox annotated pictures__ with more than one identifications, that don't involve human or where all IDs are of the same species:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for seq_id, animals in seq_with_multi_ids.items():\n",
    "    if seq_id in all_annotated_seq:\n",
    "        if 'human' not in animals and len(set(animals)) > 1:\n",
    "            print(animals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So not that many images that are animals-only and contain more than one label. This is few enough to label manually."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notes\n",
    "\n",
    "Some other names used for denoting human and empty sequences:\n",
    "```\n",
    "other_human_tags = {\n",
    "    'camera  trapper', # extra whitespace or misspelt\n",
    "    'camera trapper ',\n",
    "    'camera trappper'\n",
    "}\n",
    "\n",
    "other_no_animal = {\n",
    "    'animal not on list',  # should map to 'unknown animal'?\n",
    "    'no  animal',  # extra whitespace\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bounding box annotations\n",
    "\n",
    "To come. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:py35]",
   "language": "python",
   "name": "conda-env-py35-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
